Assessing the properties of patient-specific treatment effect estimates from causal forest algorithms under essential heterogeneity

Background Treatment variation from observational data has been used to estimate patient-specific treatment effects. Causal Forest Algorithms (CFAs) developed for this task have unknown properties when treatment effect heterogeneity from unmeasured patient factors influences treatment choice – essential heterogeneity. Methods We simulated eleven populations with identical treatment effect distributions based on patient factors. The populations varied in the extent that treatment effect heterogeneity influenced treatment choice. We used the generalized random forest application (CFA-GRF) to estimate patient-specific treatment effects for each population. Average differences between true and estimated effects for patient subsets were evaluated. Results CFA-GRF performed well across the population when treatment effect heterogeneity did not influence treatment choice. Under essential heterogeneity, however, CFA-GRF yielded treatment effect estimates that reflected true treatment effects only for treated patients and were on average greater than true treatment effects for untreated patients. Conclusions Patient-specific estimates produced by CFAs are sensitive to why patients in real-world practice make different treatment choices. Researchers using CFAs should develop conceptual frameworks of treatment choice prior to estimation to guide estimate interpretation ex post. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02187-5.


Introduction
Developing patient-specific treatment effect evidence to guide individualized treatment decision-making is a cornerstone of patient-centered care [1][2][3].The need for patient-specific evidence follows from the acknowledged breadth of outcome variation across patients receiving the same treatment.[4][5][6][7][8][9][10].This phenomenon is known as treatment effect heterogeneity and is defined as "nonrandom variation in the direction of magnitude of a treatment effect" [11].With their restrictive inclusion/ exclusion criteria, randomized controlled trials cannot generate appropriate patient-specific evidence for many patients [4,[11][12][13][14].As an alternative, observational data provide treatment variation within the context of realworld practice and a diversity of patients well beyond those evaluated in RCTs [2,3,12,15,16].The traditional approach to estimate patient-specific treatment effects using observational data is to use parametric estimators and assign to each patient an estimated treatment effect from a "reference class" of patients [17][18][19][20][21][22].Reference classes are defined a priori by the researcher based on combinations of measured patient factors that are conceptually associated with treatment effect heterogeneity [17][18][19][20][21][22].The need to specify reference classes a priori has been described as "the central problem when using group evidence to forecast outcomes (or treatment effects) in individuals" [18].Even with a small number of measured patient factors, a patient could be placed in many reference classes, leaving it unclear which class is best aligned to the patient [10,17,18].

Methodological background
Assigning patients into appropriate reference classes using observational data either a priori with parametric estimators or ex post through a CFA does not ensure that the resulting treatment effect estimates are appropriate for each patient.The conventional criticism of using observational data to estimate treatment effects is the risk of omitted variable bias in which unmeasured factors with direct effects on study outcomes are distributed differently between treated and untreated patients [52].However, even if patients were assigned to appropriate reference classes and omitted variable bias risk is mitigated through study design, a single treatment effect estimate for a reference class may not be appropriate for each patient within a class.The econometric literature has shown that parametric estimators yield average treatment effect estimates for patient subsets based on treatment choice .Under the assumption of no omitted variable bias, regression-based estimators yield unbiased estimates of the average treatment effect for the subset patients who chose treatment or the average treatment effect on the treated (ATT) [43, 48-50, 54, 57, 60, 68, 69].Consequently, if treatment choice in an empirical setting was influenced by unmeasured patient factors related to treatment effectiveness -essential heterogeneity -the parametric estimate of ATT for a reference class will overstate the true treatment effects for the untreated patients in the class [39,49,50,70].Researchers using parametric estimators have learned not to generalize a single parametric treatment effect estimate to all patients in a population [38, 43, 47-51, 53, 55, 56, 58, 59, 61, 67, 70, 71].
In contrast, the properties of estimated patient-specific treatment effects from CFAs under essential heterogeneity have not been explored.Simulation research has demonstrated that CFAs accurately yield patient-specific treatment effects under the broad condition of ignorability [24,[26][27][28][29][34][35][36].Ignorability assumes that omitted variable bias does not exist within an empirical setting.However, ignorability also assumes that essential heterogeneity does not exist.These dual assumptions can be described using potential outcome notation.Define Y 1i and Y 0i as the potential outcomes for patient "i" when treated and untreated, respectively, and (Y 1i -Y 0i ) is the true potential treatment effect for patient "i".Define T i as the observed treatment choice for patient "i" and X i as the set of measured patient factors available to the researcher.Ignorability is broadly defined as (Y 1i , Y 0i ) ⊥ T i | X i or conditional on X i , treatment choice is independent of both potential patient outcomes [72].As such, ignorability implies the following two distinct assumptions. (1.1) Assumption (I.1) says that, within a reference class of patients based on X i , treatment choice is unrelated to untreated potential outcomes across patients.Or stated differently, treatment choice is unrelated to unmeasured patient factors associated with Y 0i .Assuming (I.1) eliminates the risk of omitted variable bias in an observational study [52].
Even if assumption (I.1) is true though, treatment effects may remain heterogeneous within a reference class defined by X i .With respect to this heterogeneity, ignorability further assumes: Assumption (I.2) says that, within a reference class of patients defined by X i , treatment choice within the class is not influenced by unmeasured patient factors associated with treatment effectiveness or there is no essential heterogeneity [38,39,45].If ignorability holds within a reference class defined by X i , only the treatment variation that stems from patient factors unrelated to treatment effectiveness will be used to estimate treatment effects within the class.Consequently, CFA simulation results which assume ignorability provide no guidance on the properties of patient-specific treatment effect estimates in real-world scenarios in which essential heterogeneity is thought to exist a priori.For example, the effectiveness of surgery for patients with shoulder fractures is thought to vary with fracture complexity and patient resiliency, which in turn influence surgery choice [73][74][75][76][77], but fracture complexity and patient resiliency are not measurable in large observational databases such as Medicare claims data [73][74][75][76][77].A study using a causal forest algorithm to estimate patient-specific surgery effects using Medicare claims data theorized a priori that the resulting estimates should be interpreted in terms of essential heterogeneity, but evidence was not available to guide these interpretations [78].In addition, understanding influence of essential heterogeneity on CFA estimates is especially relevant to researchers proposing to use CFAs in effectiveness-implementation hybrid study designs in which the promotion of a treatment is randomized to satisfy assumption (I.1) but decision makers still have the discretion to choose among available treatments based on individual patient factors [79][80][81][82][83][84][85][86][87][88][89][90][91][92][93][94][95].
To provide this guidance, this study modified a treatment choice-based simulation method used in previous research to assess the impact of essential heterogeneity on patient-specific treatment effect estimates from a CFA estimator [43,48,53].Eleven patient populations were simulated with the same distribution of true treatment effects drawn from identical distributions of simulated patient factors.All eleven simulations were specified to satisfy assumption (I.1).The simulations varied by plausible differences in the extent to which knowledge of true patient-specific treatment effects influenced treatment choice.We used the causal forest algorithm within the generalized random forests application (CFA-GRF) [24-26, 96, 97] to estimate patient-specific treatment effects for each simulated population.CFA-GRF has been singled out as the most appropriate CFA for estimating patientspecific treatment effects [98].To tease out the influence of essential heterogeneity, we applied CFA-GRF to each simulated population under conditions of (1) fully observed heterogeneity in which all patient factors associated with treatment effect heterogeneity are observed by the researcher and (2) partially observed heterogeneity in which only a subset of the patient factors associated with treatment effect heterogeneity are observed by the researcher.Patient-specific treatment effect estimates from CFA-GRF were used to calculate the average absolute and average percentage differences between true and estimated effects for each simulated population and for treatment choice-based population subsets.

Simulation model
Our simulation model follows the general framework in the essential heterogeneity literature [39,43,45,48,53,99]. Figure 1 contains a directed acyclic graph (DAG) illustrating the conceptual framework of treatment effect heterogeneity, treatment choice, and outcome within our simulations.Figure 1 was adapted from standard DAG approaches to reflect patient factors affecting treatment effectiveness and the treatment effect knowledge of the decision maker [100,101].Outcome (Y i ) equals 1 if patient "i" is cured of the medical condition, and 0 if not cured.P(Y i |T i ,S i ) is the probability of cure for patient "i" conditional on treatment choice (T i ) and patient severity (S i ).Patient cure probability also varies with accumulated other factors (W i ).Treatment (T i ) equals 1 if the patient receives treatment and 0 otherwise, which we designate as watchful waiting.In all simulations, the true absolute treatment effect for each patient "i" (TE i ) on Y i relative to watchful waiting varies with six factors X 1i , X 2i , X 3i , X 4i , X 5i , and X 6i based on the following equation: (1) X 1i , X 2i , X 3i , X 4i , X 5i , and X 6i are binary variables distributed Bernoulli for each patient with a probability of 0.5.Each β x equals the absolute change in treatment effect if a patient has condition "X" (β 1 = 0.024, β 2 = 0.048, β 3 = 0.071, β 4 = 0.095, β 5 = 0.119, β 6 = 0.143).With these parameter values, simulated patients have true treatment effects ranging from 0 to 0.5 with an average true treatment effect of 0.25 for each simulated population.For example, if the simulated patient factors for patient "i" (X 1i ,X 2i ,X 3i ,X 4i ,X 5i ,X 6i ) were (1,0,1,0,1,0), then patient "i's" true TE i was.214 = (0.024 + 0 + 0.071 + 0 + 0. 095 + 0). Figure 2 illustrates the identical distribution of The true cure probability relationship for each simulated patient "i" signified by the red arrows in Fig. 1 is as follows: α 0 equals the untreated patient cure probability at the mean severity level and was set to 0.1 in all simulations.Patient severity (S i ) was specified as a uniformly distributed random variable from -0.5 to 0.5.α S equals the change in untreated patient cure probability for differences in severity level and was set to -0.1 in all simulations.As a result, in each simulated population, watchful waiting patients (T i = 0) had a cure probability ranging from 0.05 to 0.15.Treated patients (T i = 1) had a cure probability ranging from 0.05 to 0.65.All other unmeasured patient factors impacting the probability of a cure are found in (W i ).
The green arrows in Fig. 1 describe the treatment choice process that varied across the eleven simulations.
In each simulation, it is assumed that the treatment decision-maker observes X 1i , X 2i , X 3i , X 4i , X 5i , and X 6i and forms an expected treatment effect for patient "i".The simulations differ by the knowledge available to decision makers of the relationship between the six patient factors and treatment effectiveness, as represented by the expected treatment effect function for simulation "j": K j ∈ (0, .1,.2,.3,.4,.5, .6,.7,.8,.9, 1) is the proportion of patient-specific TE i knowledge used by decision makers in simulation "j" that is distinct from the average population treatment effect.Decision makers are more aware of each patient's true treatment effect relative to the average population treatment effect as K j increases from 0 to 1 across simulations.For example, in the simulation in which K j = 0, decision makers only have knowledge of the average treatment effect across the population (0.25) when making treatment decisions for each patient.Alternatively, when K j = 1, decision makers have exact knowledge of the treatment effect for patient "i" from observed X 1i , X 2i , X 3i , X 4i , X 5i , and X 6i .ETE ij (X 1i , X 2i , X 3i , X 4i , X 5i ,X 6i ,K j ) is used to calculate the expected value of treatment for patient "i" based on the following: (2) sums the expected benefits and detriments (e.g., costs) of treatment relative to watchful waiting for patient "i" that is conditional on knowledge K i , X 1i ,X 2i ,X 3i ,X 4i ,X 5i ,X 6i , direct treatment cost C, cure value V, and U i other accumulated factors affecting treatment value, which are independent of treatment effectiveness for patient "i".ETE ij (X 1i , X 2i , X 3i , X 4i , X 5i ,X 6i ,K j ) equals the decision maker's expected change in cure probability from treatment.To focus this study on the impact of essential heterogeneity across simulations, all patients were assigned a cure value V of $800 and a treatment cost C of $200.These values were chosen because they yield simulated population treatment percentages of approximately 50%.V designations of $500 and $1100 were also tried, which yielded different population treatment percentages but did not influence the interpretation of our results relative to the essential heterogeneity.U i is the source of treatment valuation that varies across patients, is unrelated to treatment effectiveness and is unmeasured by the researcher.U i values were assigned to patients from a normal distribution with a mean of zero and a common variance σ 2 U across simulations.Furthermore, in all simulations, U i was specified independently of W i so that the differences in unmeasured factors influencing treatment choice had no relationship with the unmeasured factors directly effecting cure so that ignorability assumption (I.1) was satisfied.
In all simulations, decision makers chose treatment for patient "i" if EVT i was positive and watchful waiting if EVT i was negative.In the simulation in which the knowledge of patient-specific treatment effect heterogeneity is zero (K j = 0), only variation in U i leads to different treatment choices across simulated patients.As K j increases across simulations, a larger proportion of the variation in treatment choice variation is attributable to treatment effectiveness or sorting on the gain.Once a treatment was chosen for each patient, cure (Y i ) was simulated using a Bernoulli function of P(Y i |T,S i ) for patient "i", given T i and S i .Table 1 summarizes the model parameters and values used in the simulations.
To support large sample properties, we generated 50,000 patients in each simulation.The blue arrows in Fig. 1 describe the variables observed by the researcher after each simulation.By varying the knowledge of TE i across simulations with K j and the patient factors observed by the researcher, we can tease out the impacts of essential heterogeneity on patient-specific treatment effect estimates.In each scenario, researchers observe T i , Y i , S i .We designate "fully observed heterogeneity" as the empirical condition in which researchers observe all six patient factors X 1i , X 2i , X 3i , X 4i , X 5i , and X 6i .We designate "partially observed heterogeneity" as the empirical condition in which researchers observe only X 1i , X 2i , X 3i , and X 4i .Under fully observed heterogeneity, treatment effects are homogeneous within each reference class spanned by combinations of the complete set of patient factors.When K j = 0, decision-makers are not knowledgeable of the sources of treatment effect heterogeneity, and treatment choice varies only with U i .Under fully observed heterogeneity with K j > 0, decision-makers are at least partly knowledgeable of the sources of treatment effect heterogeneity, with the effect of this knowledge on treatment choice increasing with K j .Under partially observed heterogeneity, treatment effects are heterogeneous within the reference classes defined by the observed set of patient factors.
Partially observed heterogeneity with K j = 0 has been dubbed nonessential heterogeneity in the econometric literature [38,39].Under nonessential heterogeneity, treatment choice is not influenced by the unmeasured patient factors affecting treatment effectiveness within a reference class.Scenarios with partially observed heterogeneity and K j > 0 represent essential heterogeneity.In these scenarios, treatment effects are heterogeneous within each reference class, with the influence of treatment effect heterogeneity on treatment choice increasing with K j across simulations.

Simulated population summaries
Treatment effect estimation using observational data requires what is called a common area of support or overlap between treated and untreated patients or that patients with the same measured patient factors must be observed to make different treatment choices [102,103].It has been shown that including patients in study populations with insufficient overlap can lead to biased treatment effect estimates [104,105].The treatment choice-based simulations used here naturally reduce overlap the more that treatment choice is influenced by patient factors affecting treatment effectiveness.To monitor this influence across simulations, we used the

Parameter Description
Value and Distribution Absolute increase in treatment effect on cure when X 1 = 1 .024 Absolute increase in treatment effect on cure when X 2 = 1 .048 Absolute increase in treatment effect on cure when X 3 = 1 .071 Absolute increase in treatment effect on cure when X 4 = 1 .095 Absolute increase in treatment effect on cure when X 5 = 1 .119 Absolute increase in treatment effect on cure when X 6 = 1 .143

TE i
True treatment effect on outcome for patient "i" as a function of X 1i ,X 2i ,X 3i ,X 4i ,X 5i ,X 6i Ranges from 0 to .5.Distribution in Fig. 2 S i Patient "i" severity level directly effecting cure but have no effect on treatment effectiveness and are unrelated to treatment choice Distributed Uniform(-.5,.SAS PROC LOGISTIC procedure to estimate the treatment propensity score for each patient in each simulated population under both "fully observed heterogeneity" and "partially observed heterogeneity".Each simulated patient was then designated into either the "overlapped" subset with a propensity score between 0.05 and 0.95 or into the nonoverlapped subset with propensity scores either less than 0.05 or greater than 0.95 [104,105].We then estimated the percentage of patients in each simulated population who were treated, untreated, overlapped and treated, overlapped and untreated, nonoverlapped and treated, and nonoverlapped and untreated and then calculated the true average TE i in each subset. Next, for each simulated population, we estimated a linear probability model (LPM) of treatment choice T i on true TE i using the SAS PROC REG procedure with the SCORR1 option.This procedure provides the percentage of treatment choice variation within the simulated population that is attributable to variation in the true treatment effect to serve as a measure of the influence of the true treatment effect on treatment choice.Last, we estimated the effect of T i and S i on Y i using a LPM in each simulated population.The parametric treatment effect literature states that the LPM estimator of the parameter on T i will yield a consistent estimate of the average absolute treatment effect on the treated in each simulated population [43, 48-50, 54, 57, 60, 68, 69].

Casual forest algorithm
We then applied the CFA-GRF [24-26, 96, 97] using the "grf " package in R [106] to estimate treatment effects for each patient in each simulated population.CFA-GRF evolved from standard classification and regression tree (CART) and random forest ensemble methods [24-26, 96, 97].CART procedures iteratively partition "nodes" of observations within a population into subnodes or "branches" based on measured factors in a manner that maximizes the differences in an outcome across possible branches [97].A tree is formed by viewing all of the subsequent branches of the study population.The final subnode or leaf on the end of a branch can be thought of as an algorithm-generated ex post reference class for observations with factors matching the leaf.The random forest approach is an ensemble method that generates a "forest" of CART trees through resampling from the study population [96].The estimated outcome for a single observation is the average outcome across the leaves in the trees in the forest containing that observation.CFA-GRF extends the random forest approach to the goal of estimating the causal effect of a predictor of interest (e.g., a treatment) on an outcome.CFA-GRF partitions observations based on measured factors in a manner that maximizes the expected differences in the estimated treatment effect on an outcome [24][25][26].For each simulated population, CFA-GRF was run using 4000 trees, minimum leaf sizes of 50 and the "honest" approach suggested by the algorithm creators, in which trees were estimated using a randomly selected 25% of the simulated population [26].We ran CFA-GRF specifying X 1i , X 2i , X 3i , X 4i , X 5i , X 6i , and S i in the "fully observed heterogeneity" specification and X 1i , X 2i , X 3i , X 4i , and S i in the "partially observed heterogeneity" specification.As a result, each patient in each simulated population had two treatment effect estimates.We assessed the properties of these estimates by evaluating their ability to identify average treatment effect parameters for each simulated population and treatment choice-based subsets of the population.We calculated the average absolute and percentage difference between the true treatment effect for each simulated patient (TE i ) and estimated treatment effects for the full population and subsets of population based on treatment choice and propensity score "overlap" status.

Summary information across simulated populations
Table 2 summarizes each simulated population.Column A in Table 2 shows the proportion of treatment effect expectations (ETE i ) shaped by the true effect for each patient (TE i ) in each simulation -K j from Eq. ( 3).Column B shows the percentage of treatment choice variation in each simulation explained by TE i .Columns C and D show the percentage of simulated patients who overlapped or had propensity scores greater than 0.05 and less than 0.95 in the fully observed heterogeneity and partially observed heterogeneity scenarios, respectively.Columns E through J show the true average TE i for subsets of treated, untreated, overlapped and treated, overlapped and untreated, nonoverlapped and treated, and nonoverlapped and untreated patients, respectively.These columns also show in parentheses the percentage of patients within each subset.
Patient-specific treatment effects (TE i ) do not influence treatment choice in simulation 1, and as a result, the average true TE i is close to the true population average treatment effect of 0.25 for both treated and untreated patients.Moving from simulations 2 through 11, though, the knowledge of TE i increases in decision making, and TE i explains a larger portion of the variation in treatment choice (column B).Under fully observed heterogeneity, all patients are fully overlapped in simulations 1 through 6.The percentage of overlapping patients falls from 97.0% to 68.8% in simulations 7 through 11.Under the partially observed heterogeneity, all patients overlapped across all simulations.Columns E and F show how the greater influence of TE i on treatment choice leads to sorting on the gain.The average TE i for the treated patients in Column The proportion of patient-specific TE i knowledge used by decision makers in simulation "j" in developing the expected treatment effect for patient "i" that is distinct from the population average treatment effect based on the equation   (34.4) .214(34.4) .423(15.8) .077(15.3) .329 E increased from 0.250 to 0.329 as K increased from 0 to 1, while the average TE i for the untreated patients in Column F fell from 0.251 to 0.172 across this range.Columns G through J stratify treated and untreated patients by overlap status under fully observed heterogeneity.
The average TE i of nonoverlapped treated patients (column I) is greater than that of overlapped treated patients (column G).Likewise, the average TE i of nonoverlapping untreated patients (column J) is less than that of overlapping untreated patients (column H).Column K of Table 2 shows the estimated treatment effect for the full population in each simulation using a linear probability model (LPM).A comparison of these estimates with column E confirms that LPM yields estimates of the average treatment effect on the treated (ATT) [57].When treatment effects are heterogeneous, LPM estimates appropriately generalize to untreated patients only when TE i does not influence treatment choice, as in simulation 1 [57].

CFA-GRF results under fully observed heterogeneity
Table 3 contains the average percentage differences between the true treatment effects and individual treatment effect estimates from CFA-GRF for each of the eleven simulated populations under fully observed heterogeneity.Estimates are reported for the full population in each simulation and treatment-choice-based subsets.Table A.1 in the Additional file 1 shows these results in terms of average absolute differences between the true treatment effect values and estimated treatment effects.The percentage differences in Table 3 were calculated using the average true treatment effect for each population subset found in Table 2 and the average absolute differences for each subset in Table A.1.For example, the average percentage difference between the estimated and true treatment effect values for the full population in simulation 1 under fully observed heterogeneity is 100*(-0.0014)/0.25 = -0.56%.Column E of Table 3 shows that under fully observed heterogeneity on average, CFA-GRF produces treatment effect estimates that reflect each population across simulations.However, as treatment choice becomes more responsive to TE i , CFA-GRF estimates increasingly understate the true treatment effect for treated patients and overstate the true treatment effect for untreated patients.Simulation 1 under fully observed heterogeneity fully satisfies ignorability, and CFA-GRF produces patient-specific treatment effect estimates that on average reflect the true patient treatment effects for the entire population and for both treated and untreated patient subsets.In contrast, in simulation 11, in which decision-makers have full knowledge of TE i in treatment choice, the treatment effect estimates for treated patients are on average 14.74% lower than the truth, and the estimated treatment effects for untreated patients are on average 30.99% higher than the truth.These percentage differences are not symmetric because untreated patients have a lower average true treatment effect.Columns G to J in simulations 6 through 11 demonstrate that these differences exist for both overlapping and nonoverlapping patients but are more pronounced for nonoverlapping patients.

CFA-GRF results under partially observed heterogeneity
Table 4 contains the average percentage differences between the true treatment effect values and CFA-GRF treatment effect estimates for each simulated population under partially observed heterogeneity.Under partially observed heterogeneity all patients are overlapped so that the columns G through J found in Table 3 are unnecessary.Under ignorability in simulation 1, CFA-GRF again produces estimates that on average are close to true patient treatment effects for the entire population and for the treated and untreated patient subsets.In simulation 1, CFA-GRF estimates under partially observed heterogeneity had larger standard errors than those under fully observed heterogeneity (see Table A.2). Treatment effects estimated from CFA-GRF for treated patients closely reflect their true values across all eleven simulations.In contrast, CFA-GRF estimates for untreated patients are higher than their true values across simulations 2 through 11, with the differences increasing with the level of TE i influence on treatment choice.For example, based on the true average treatment effect for untreated patients from Table 2 and the average absolute differences for each population in Table A.1, on average, CFA-GRF estimates for untreated patients are 2.4% greater than their true values in simulation 2 -100*(0.006)/(0.246))and 76.3% greater than their true values in simulation 11 -100*(0.1312)/(0.172).As a result, when TE i influences treatment choice under partially observed heterogeneity, CFA-GRF estimated treatment effects across the whole population are on average greater than their true values.

Discussion
Causal forest algorithms (CFAs) have been proposed to estimate patient-specific treatment effect evidence using observational data [23][24][25][26][27][28][29][30][31][32][33]107].To apply CFAs, observational databases must contain patients with similar combinations of measured factors who were observed to make different treatment choices.The positive properties of CFAs for estimating patient-specific treatment effects have been established using simulation models under the assumption of ignorability [26][27][28][29][34][35][36].Under ignorability, only the treatment variation from unobserved patient factors not associated with treatment effect heterogeneity is available to estimate patient-specific treatment effects.Therefore, it is unknown whether the positive properties of CFAs extend to real-world clinical applications in which patient factors affecting treatment effectiveness also influence treatment choice.In many real-world clinical scenarios it is plausible and likely that observed treatment choices reflect unmeasured patient factors related to expected treatment effectiveness for each patient -a condition defined in econometric literature as essential heterogeneity [38,39,43,[48][49][50]53].This paper used simulations that varied only by the relationship between treatment effectiveness and treatment choice to assess the impact of essential heterogeneity on the ability of CFAs to estimate patient-specific treatment effects.The causal forest algorithm within the generalized random forests application CFA-GRF has been singled out as most appropriate CFA estimate patient-specific treatment effects and was used here [98].To tease out the impacts of essential heterogeneity, CFA-GRF estimates were evaluated in settings in which all patient factors associated with treatment effect heterogeneity were fully observed by the researcher and in settings in which the patient factors associated with treatment effect heterogeneity were not fully observed by the researcher.We replicated the positive properties of CFA-GRF in simulation scenarios under ignorability.CFA-GRF yielded average population-wide estimates and average estimates by patient subsets based on treatment choice under ignorability that were closely aligned with their true values whether heterogeneity was fully or partially observed within the algorithm.As a result, if researchers can make a strong conceptual case a priori that treatment effectiveness is unrelated to treatment choice, they can be confident that CFA-GRF can yield appropriate treatment effect estimates across a population of patients.In simulation scenarios in which decisionmakers use patient factors associated with treatment effectiveness in making treatment decisions [38,39,43,[48][49][50]53], the ability of CFA-GRF to identify patientspecific treatment effects varied with the influence that treatment effectiveness had on treatment choice and whether the full range of patient factors associated with treatment effect heterogeneity were observed and specified in the algorithm.When all patient factors affecting treatment effect heterogeneity were fully specified, CFA-GRF produced treatment effect estimates that reflected true treatment effects across each population subset when the influence of treatment effectiveness on treatment choice was low.As this influence increased, however, treatment effect estimates showed increasingly negative bias for treated patients and positive bias for untreated patients.A substantial portion of this bias is likely attributable to nonoverlapping patients becoming a higher percentage of patients as the influence of treatment effectiveness on treatment choice increases.Under partially observed heterogeneity, all patients overlapped a The proportion of patient-specific TE i knowledge used by decision makers in simulation "j" in developing the expected treatment effect for patient "i" that is distinct from the population average treatment effect based on the equation ETE i = K j * (TE i (X 1i ,X 2i ,X 3i ,X 4i ,X 5i ,X 6i )-.25) + .25 b The percentage of treatment choice variation explained by TE i using a linear probability model of treatment choice T i on true TE i using SAS PROC REG procedure with the SCORR1 option c Percentage of patients in sample with treatment propensity score greater than .05and less than .95when only X 1i , X 2i , X 3i , X 4i factors are specified in the propensity score equation in all simulations.CFA-GRF produced estimates that closely reflected the true treatment effect values for treated patients across all levels of influence of treatment effectiveness on treatment choice.In contrast, CFA-GRF estimates for untreated patients were biased high, with the extent of this bias increasing with the level of influence that treatment effectiveness had on treatment choice.

Simulation
As a result, CFA-GRF estimates of patient-specific treatment effects using observational data must be assessed through the prism of the assumed reasons why patients with similar measured factors in a real-world setting were observed making different treatment choices.This requires researchers to explicitly develop conceptual frameworks of treatment choice to support these assumptions a priori to ensure proper interpretation of treatment effect estimates ex post.The call for treatment choice conceptual frameworks to guide treatment effectiveness research using observational data has long been stated in economics [44,48,49,[108][109][110], and the importance of these frameworks is now being more widely appreciated [21,111,112].A conceptual framework of treatment choice should describe the factors thought to influence treatment choice, the relationship of these factors to treatment effectiveness and whether these factors are measured within the available data.Given the study findings, it would be important for researchers to qualify patient-specific estimates from CFA-GRF in clinical scenarios in which essential heterogeneity likely exists.In these scenarios researchers should state that patient-specific estimates from CFA-GRF are likely biased high for the average patient with a given combination measured patient factors and are best aligned to those patients a provider is more likely to treat.
This study is limited by its use of only using one of the several CFAs available to produce patient-specific evidence using observational data.While the CFA-GRF was singled out as most appropriate for estimating patient-specific treatment effects [98], it is possible that other CFAs are available that can incorporate and correct for the conditions associated with treatment choice when making treatment effect estimates.To this end, the simulated datasets produced here are available from the authors for use by other CFA developers to assess the impact on treatment effect estimates of the influence of treatment effect heterogeneity on treatment choice.In addition, the simulation approach in this paper is reported fully, is straightforward to reproduce, and is easy to modify, so researchers can assess the robustness of our results to parameter changes.

Conclusion
The acknowledged breadth of treatment effect heterogeneity across patients heightens the need to find empirical approaches to find patient-specific treatment effect evidence [4][5][6][7][8][9][10].Causal forest algorithms (CFAs) have been proposed to analyze the treatment variation found within large observational databases to develop patient-specific evidence [23][24][25][26][27][28][29][30][31][32][33].The simulation results in this paper show that the patient-specific estimates produced by a CFA are sensitive to the reasons why patients with the same set of measured factors were observed to make different treatment choices.It is likely in many real-world clinical scenarios that decision-makers are cognizant of how patient factors affect treatment effectiveness and use this information in making treatment decisions [38,39,43,[48][49][50]53].And many real-world decision makers may know more about the list of patient factors affecting treatment effectiveness than the researchers who collect measures for research [22,113,114].As a result, it is foundational that researchers using CFAs to estimate patient-specific evidence using observational data build conceptual frameworks of treatment choice prior to estimation to guide estimate interpretation ex post.

Fig. 1 Fig. 2
Fig. 1 Directed Acyclic Graph (DAG) Describing the Conceptual Framework for the Simulation Model in which Patient Factors Affecting Treatment Effectiveness Affect Treatment Choice through Decision Maker Knowledge 25) + .25.The population average treatment effect is .25 in all simulations b The percentage of treatment choice variation explained by TE i using a linear probability model of treatment choice T i on true TE i using SAS PROC REG procedure with the SCORR1 option c Percentage of patients in sample with treatment propensity score greater than .05and less than .95when all six patient factors are fully specified in the propensity score equation d Percentage of patients in sample with treatment propensity score greater than .05and less than .95when only X 1i , X 2i , X 3i , X 4i factors are specified in the propensity score equation

Table 2
Summary information for simulated populations

Table 3
Average Percentage Differences Between the Estimated Treatment Effects and True Treatment Effects from the Causal Forest Algorithm within the Generalized Random Forests Application (CFA-GRF) Under Fully Observed Heterogeneity Across Simulated Populations Which Differ by the Extent That Treatment Effect Influences Treatment Choice Percentage of patients in sample with treatment propensity score greater than .05and less than .95when all six patient factors are fully specified in the propensity score equation c j )

Table 4
Average Percentage Differences Between the Estimated Treatment Effects and True Treatment Effects from the Causal Forest Algorithm within the Generalized Random Forests Application (CFA-GRF) Under Partially Observed Heterogeneity Across Simulated Populations Which Differ by the Extent That Treatment Effect Influences Treatment Choice